Introduction

Despite its publication nearly 25 years ago, the controversial book The Bell Curve (TBC)1 has persisted in conversations about IQ. The authors, Richard Herrnstein and Charles Murray, claim that IQ strongly influences many significant life events – like whether someone will enter poverty, drop out of school, get married, or even go to jail. Even more contentious, they extend the consequences of low IQ to demographics like ethnicity and nationality. As evidence, the authors build and discuss dozens of probability models using public data.

    Although TBC has been criticized extensively, these responses have mostly come from social psychologists and not statisticians. For this report, I intend to evaluate the statistical meaning of 24 of the nearly 3 dozen models via bootstrap regression. In doing so, I will illustrate the impact of two scientific philosophies, explanation and prediction, in the context of presenting statistical models to the public. I will describe why prediction is the only philosophy relevant to the claims in TBC and, finally, that the 24 probability models are inadequate in supporting the authors’ public policy recommendations.

Background

The Bell Curve and its Claims

TBC stands out not for being a stuffy academic paper, but instead as accessible prose intended for the general population. The book is political, pop-psychology by design. TBC can be simplified into 3 sections: a summary of prior IQ research, a presentation of new probability models derived by the authors, and a discussion of the findings including recommendations for public policy. With this being an accessible book, the probability models are described in simple, interpretable terms. In fact, TBC includes extensive discussion on what multivariate analysis means – from explaining what a “variable” means in statistics to what logistic regression attempts to model. The authors even include an entire appendix titled, “Statistics for People Who Are Sure They Can’t Learn Statistics”.

    In this format, it’s clear that the authors want to persuade a general audience and leverage the language of statistics to make their public policy recommendations more credible. Among these recommendations, TBC proposes that the US government should:
  1. Discourage births by poor women:

The technically precise description of America’s fertility policy is that it subsidizes births among poor women, who are also disproportionately at the low end of the intelligence distribution. We urge generally that these policies, represented by the extensive network of cash and services for low-income women who have babies, be ended.

  1. Spend more on gifted students and less on disadvantaged students:

Reallocate some portion of existing elementary and secondary school federal aid away from programs for the disadvantaged to programs for the gifted… At present, there is an overwhelming tilt toward enriching the education of children from the low end of the cognitive ability distribution. We propose more of a balance across the cognitive ability distribution.

  1. Shift immigration policy away from family reunification and toward individual cognitive assessments:

The rules that currently govern immigration provide the other major source of dysgenic pressure. It appears that the mean IQ of immigrants in the 1980s works out to about 95. The low IQ may not be a problem: in the past, immigrants have sometimes shown large increases on such measures. But other evidence indicates that the self-selection process that used to attract the classic American immigrant – brave, hard working, imaginative, self-starting, and often of high IQ – has been changing, and with it the nature of some of the immigrant population.

It should be among the goals of public policy to shift the flow of immigrants away from those admitted under the nepotistic rules (which broadly encourage the reunification of relatives) and toward those admitted under competency rules…

    Each one of these claims is dependent on the accuracy about how low IQ impacts various indicators relevant to American society. In their construction and review of new probability models, the authors support their claims using the following indicators (or “binary targets” in the language of statistical modeling):
  • Whether an individual will be under the official poverty line
  • Whether an individual will permanently drop out of high school
  • Whether an individual will receive a GED instead of a high school diploma
  • Whether an individual will receive a bachelor’s degree
  • Whether an individual will be out of the labor force for four weeks or more
  • Whether an individual will be unemployed for four weeks or more
  • Whether an individual will ever marry before the age of 30
  • Whether an individual will be divorce within the first 5 years of marriage
  • Whether an individual will go to jail[include footnote about describing this in more detail later]
  • Whether an individual will have “Middle Class Values” as defined by the authors[list in the footnotes]

Data Used in The Bell Curve

The data used to evaluate the relationship between IQ and each of these binary targets is the National Longitudinal Survey of Youth 1979 (NLSY79)2, which is a public dataset made available by the Bureau of Labor Statistics. In this academic study, 12,686 young Americans (aged 14 – 21 at the survey’s start) of varying backgrounds were interviewed each year between 1979 through 1994. Although the survey continued after this time, the last relevant year for TBC was 1990. These interviews covered a vast number of topics which were tabulated and, more recently, released online.

    Critics of TBC note that the NLSY79 lacks an explicit measure of cognitive ability. In lieu of such a variable, TBC authors used a military achievement test called the Armed Forces Qualifications Test (AFQT). In 1980, the US Department of Defense and Military Services selected the NLSY79 participants to take a battery of tests that measure knowledge and skill across “general science, arithmetic reasoning, work knowledge, paragraph comprehension, numerical operations, coding speed, auto and shop information, mathematics knowledge, mechanical comprehension, and electronic information.”3 Nearly all NLSY79 participants completed the battery. Of these, TBC authors normalized the AFQT score using a revised percentile score from 1989 and labeled it “zAFQT89”.
    The timing of when the NLSY79 participants took the AFQT (i.e., 1980) is important for this report. In TBC, the binary targets listed above are actually defined for a particular time period. The main takeaway is that the events described by these targets occur in the future relative to when IQ is measured:
  • Under the official poverty line in 1989
  • Permanently dropped out of high school (which would be in the future for some NLSY79 participants at the survey’s start)
  • Received a GED instead of a high school diploma (which would be in the future for some NLSY79 participants at the survey’s start)
  • Received a bachelor’s degree (which would be in the future for some NLSY79 participants)
  • Out of the labor force for four weeks or more in 1989
  • Unemployed for four weeks or more in 1989
  • Ever married before the age of 30 (which would be in the future for all NLSY79 participants at the survey’s start)
  • Divorced within the first 5 years of marriage
  • The subject was interviewed in jail at least once from 1979 to 1990
  • Did the subject score ‘yes’ on the Middle Class Values Index (which depends on information recorded in 1989 and 1990)
    To better isolate the effect of IQ in their probability models, TBC authors include two other variables: socioeconomic status and age. For the latter, the authors use a normalized version of age from NLSY79 participants, which they call “zAge”. For the former, they derive a measure of socioeconomic status, normalize it, and call the result “zSES”. In appendix 2, the authors write: “Since the purpose of the index was to measure the socioeconomic environment in which the NLSY youth was raised, the specific variables employed referred to the parents’ status: total net family income, [total years of] mother’s education, [total years of] father’s education, and an index of occupational status of the adults living with the subject at the age of 14.”. This definition is defined in greater detail across another page of the appendix. However, I omit them here, since a secondary goal of this report is to reproduce these probability models as-is.

Reproducing the Models

Because of the book’s controversy and age, much has been said about TBC. This criticism has primarily come from social psychologists and other experts in intelligence. From my perspective, their comments seem embedded in psychology. E.g., from James Heckman’s “Lessons from the Bell Curve”4:

[The derived zAFQT89 field] is not the same as the g that can be extracted from test scores available in their data set… They do not emphasize how little of the variation in social outcomes is explained by ADQT or g.”

AFQT is an achievement test that can be manipulated by educational interventions… A person’s AFQT score is not an immutable characteristic beyond environmental manipulation.

While there is nothing wrong with these responses, I am surprised how few of them review the probabilities models described in TBC. In fact, I have only found one detailed statistical review of a TBC model.

    In 2007, Dr. Claudia Krenz5 posted, in statistical terms, a standard classification analysis of the probability model on poverty using a cutoff value of 50% and the training data to evaluate performance. At a high-level, a classification analysis means that modeled probabilities are converted into a binary field which is then analyzed to evaluate the quality and reliability of the underlying probability model. A cutoff value is typically used to define this conversion, such that a modeled probability above this cutoff value is assumed to be 1. In the context of the poverty model, e.g., NLSY79 participants are assumed to enter poverty when their modeled probabilities are above this cutoff. From Dr. Krenz’s report,

In summary, this web page describes the first logistic regression published in The Bell Curve, in which POVERTY status–either above or below an official level–was predicted by a model consisting of AFQT, SES, and AGE scores from a decade earlier.

I replicated the numbers published in the book’s Appendix 4. I then looked at how well the model fit: it didn’t, not at all in the published sample (N=3367), not much in another independent one (N=1067). I suspect HM didn’t realize they’d made a Type I error, that their model’s predictive accuracy for cases living below the POVERTY level was 0%…

    I found Dr. Krenz’s analysis interesting but thought it could be improved by extending the classification analysis to more TBC models. Namely, this report reviews 24 probability models from TBC. The 10 binary targets are examined on different NLSY79 subsets. E.g., TBC describes two poverty models: one using participants who only graduated high-school and another with participants who graduated either high-school or college. In general, these data subsets are based on education, gender, employment status, marital status, and age. TBC includes more probability models (e.g., based on IQ for mothers and their children in the NLSY79), but I did not have time to reproduce and review them.
    Whatever my findings would be, I realized that I would need to explain why classification analysis is relevant to evaluating a probability model. Someone unfamiliar with regression analysis might respond: the classification results say nothing about the probability models because they describe different things. In the following sections, I discuss why:
  1. Classification results are needed to have a fair and human-centered evaluation since TBC’s public policy recommendations have large ramifications for individuals
  2. TBC’s probability models are most appropriately evaluated as predictive classifications in the context of statistical philosophy

Human-Centered Evaluation of TBC’s Probability Models

Extraordinary claims require extraordinary evidence. When a recommendation for public policy can negatively impact a group of people, material evidence is required to support the claim. With TBC, they cite their probability models as reason to accept their conclusions. In order to assess if these models are meaningful and reliable, they should be evaluated from the perspectives of those who could be harmed by the underlying claims. | This evaluation first requires that the potentially harmed population be well-defined. Revisiting TBC’s claims, the authors say (with my added emphasis):

The technically precise description of America’s fertility policy is that it subsidizes births among poor women, who are also disproportionately at the low end of the intelligence distribution.

By saying “disproportionately”, the authors imply that there is some IQ value below which the negative consequences of their probability models become most severe and worthy of political action. More specifically, if the authors are to objectively identify individuals they think have too low of IQ, they would need to state such an IQ value. And if their probability models are to be taken seriously, then the authors should use them to clarify what “disproportionate” means.

    Focusing on a binary target with a negative outcome like poverty, a “cutoff IQ” value could be defined as the IQ where the average of modeled probabilities across a dataset is, e.g., above 50%. Said more simply, individuals would be flagged as having too low of IQ when the probability model suggests they are “likely” to enter poverty, i.e., at least a 50% chance of occurring. It’s also important to realize that all the binary targets in TBC tend to have monotonic relationship with IQ, e.g., individuals with lower IQ tend to have greater observed frequencies of entering poverty in 1989. Combining this knowledge with the description of a “cutoff IQ” above, each TBC model could identify individuals impacted by the authors’ recommendations using either a cutoff value for IQ OR modeled probability, i.e., they are linked.
    Recalling Dr. Krenz’s analysis, a cutoff value used in classification acts in the same way as a modeled probability linked to a “cutoff IQ”. Hence, a classification analysis of TBC’s probability models is a human-centered evaluation that is necessary to balance the authors’ controversial claims.

Statistical Philosophy in The Bell Curve

While the prior section may sound trivial to a statistician or data scientist, my point is that fairness recommends a classification analysis. In this section, I go further by claiming that TBC’s probability models should only be evaluated as predictive classifications based on the statistical philosophy best describing TBC’s models.

    Dr. Galit Shmueli has written many publications about two common statistical philosophies: explanation (or sometimes called causality) and prediction6. Dr. Shmueli’s main premise is (with my added emphasis):

…[Explanation and prediction] are often conflated, yet the causal versus predictive distinction has a large impact on each step of the statistical modeling process and on its consequences. Although not explicitly stated in the statistics methodology literature, applied statisticians instinctively sense that predicting and explaining are different_.

At a high-level, explanatory models:

  • Describe a direct, causal mechanism among the model variables and the model target
  • Are retrospective by focusing on historical data
  • Are motivated by extensive theory
  • Prioritize the average case of the training data
  • Are evaluated more at the individual parameter level and often with the training data alone

Whereas predictive models:

  • Describe indirect associations among the model variables and the model target
  • Are prospective by focusing on future data
  • Are motivated more by observed data than by theory
  • Prioritize the individual case of the underlying data
  • Are evaluated more at the model level and often on one or more holdout datasets

How do these criteria play out in TBC? Quotes from the authors reflect their thoughts on explanatory modeling in TBC:

  1. The probability models are generally assumed to be explanatory in nature:

Part II was circumscribed, taking on social behaviors one at a time, focusing on causal roles…

If after looking at a variety of these other things which both theory and common sense say should have some bearing on whether a person ends up in poverty, but one ends up with a large, statistically independent role for I.Q., it seems to me to make a causal statement that I.Q. looks like its a cause of poverty, it is a reasonable thing to do.7

By the end of the chapter [on poverty], we will have drawn a controversial conclusion. How did we get there? What makes us think that we have got our causal ordering right? We will walk through the analyses that lie behind our conclusions…

  1. The authors focus on individual parameters, namely IQ, instead of the overall model (R2 in the quote below):

Even an inherently strong relationship can result in low values of R2 if the data points are bunched in various ways, and relatively noisy relationships can result in high values if the sample includes disproportionate numbers of outlier… We therefore consider the regression coefficients themselves (and their associated p values) to suit our analytic purposes better than R2, and that is why those are the ones we relied on in the text.

Note: The authors do have a point that R2 is not the most meaningful statistic for every model. However, they did not adequately motivate why probability models with rare events, as in the poverty model with an overall frequency of 7%, should ignore diagnostics at the model level.

Below, I comment on the role of predictive modeling in TBC:

  1. The binary targets are not materially causal because any proposed mechanism from IQ, SES, and Age is not direct enough. Focusing again on the poverty model, many explanations are possible as to why an individual would enter poverty. Saying IQ is causal of poverty would require linking IQ with the complex process that ends with the NLSY79 participant below the poverty line. That said, such explanations would certainly be in the realm of associations, because there’s considerable chance (i.e., noise) involved.

  2. As mentioned in the section introducing data used in TBC, most binary targets are defined to be measured nearly a decade in the future (relative to when IQ was measured). Undoubtedly, modeling such targets is a prospective process.

  3. The probability models in TBC should prioritize the individual. In the prior section, I explained why fairness recommends the individual-focused classification analysis. Also, TBC’s authors frequently comment on the importance of acknowledging individuals and not to be judgmental: “We cannot think of a legitimate argument why any encounter between individual whites and blacks need be affected by the knowledge that an agqregate ethnic difference in measured intelligence is genetic instead of environmental.”

Overall, I claim that TBC’s probability models are more predictive than explanatory in nature and so they should be evaluated as such.

Roadmap for Assessing Predictive Power

Many of the binary targets are uncommon to rare in occurrence. In logistic regression, this is called imbalanced data and can make some statistics unreliable. To counter this imbalance, I estimate Matthew’s Correlation Coefficient (MCC), which is the correlation between the binary target and the binary classification derived from a TBC probability model. MCC is robust to imbalanced data because it accounts for all possible matching scenarios8:

  • True Positive: classification expects the target event to occur and it did according to the binary target
  • False Positive: classification expects the target event to occur but it did not according to the binary target
  • True Negative: classification expects the target event not to occur and it did not
  • False Negative: classification expects the target event not to occur but it did
    As a correlation, MCC can be interpreted in the same way as a regular (Pearson) correlation. The following guideline seems reasonable in my statistical experience9:
Size of Correlation Interpretation
0.90 to 1.00 (−0.90 to −1.00) Very high positive (negative) correlation
0.70 to 0.90 (−0.70 to −0.90) High positive (negative) correlation
0.50 to 0.70 (−0.50 to −0.70) Moderate positive (negative) correlation
0.30 to 0.50 (−0.30 to −0.50) Low positive (negative) correlation
0.00 to 0.30 (0.00 to −0.30) negligible correlation
    Although MCC is appropriate for the binary targets in NLSY79, it does not fully measure predictive power by itself. To do so, I reproduce a TBC model, apply it to an unseen (i.e., holdout) sample, use a cutoff value to create classifications, and then evaluate MCC. (As a comparison, I also apply the TBC model to the training data.) When MCC is sufficiently above 0 for the holdout data (e.g., above 50% via the above table), it gives evidence that the model has good predictive power. When MCC is negative or far below 50%, the model has poor predictive power.
    Lastly, since the binary targets are noisy and the NLSY79 datasets are small, I repeat the latter predictive evaluation 10,000 times via bootstrap regression. The only change to the prior evaluation process is in slightly adjusting the modeling data by resampling rows (i.e., NLSY79 participants) from the original data (with replacement and ensuring the new sample has the same size as the original data).
    From the bootstrapped distribution of MCC values for each of the 24 probability models, I calculate an appropriate confidence interval around the original sample MCC value across a range of cutoff values. I then find the largest average bootstrapped MCC value (when evaluated on both the training and holdout data) and compare the optimal MCC values over each TBC model.

Data

Since reproducing TBC’s models is a goal for this project, I want the exact data used by the authors. With the NLSY79 being a public dataset, this is possible. However, the NLSY79 uses a special variable coding system which was not noted in TBC. To credit the authors, they did release their modified datasets online and included a data dictionary10. In order to reproduce TBC’s results as closely as possible, I opted to use the authors’ data (i.e., “Nation.txt”)

Training vs Holdout data

Another consideration is how the training and holdout datasets are defined. The authors’ data includes a variable called “Sample”. As described in the NLSY79 User’s Guide, this Sample variable has the following 3 values:

  1. A cross-sectional sample of 6,111 respondents designed to be representative of the noninstitutionalized civilian segment of young people living in the United States in 1979 and born between January 1, 1957, and December 31, 1964 (ages 14–21 as of December 31, 1978)
  1. A supplemental sample of 5,295 respondents designed to oversample civilian Hispanic, black, and economically disadvantaged non-black/non-Hispanic youth living in the United States during 1979 and born between January 1, 1957, and December 31, 1964
  1. A sample of 1,280 respondents designed to represent the population born between January 1, 1957, and December 31, 1961 (ages 17–21 as of December 31, 1978), and who were enlisted in one of the four branches of the military as of September 30, 1978

For a given TBC model to reproduce, the training data is based on the cross-sectional partition (which is exactly what TBC’s authors do). The holdout data is based on the supplemental partition. This selection for a holdout dataset is motivated from Dr. Krenz’s classification analysis, where she used the supplemental sample as a holdout.

Data and Model Specifications

Also needed for reproduction are the exact data and model specifications. The authors listed parameter and diagnostic output for each model in appendix 4 of TBC. I used these to experiment with the “Nation.txt”" dataset in order to reproduce TBC’s models as closely as possible. 23 of the 24 reproduced models agreed with the book’s listed parameter values up to 4 decimal places or more. The last model reproduced the parameter values to a similar precision but the model’s included categorical had different values. I suspect these differences come from the authors’ use of an older version of STATA, and myself using R. Perhaps the two modeling implementations handle categorical levels differently in the design matrix, causing different parameter values.

Below I list the shorthand names I give for each TBC model with their corresponding binary target:

setwd("B:/DATA 512/Final Project")
source('200 Code/000_Misc.R')
source('200 Code/400_Model_Definitions.R')
library(knitr)

kable(tbc.models[, .(label, target.description)])
label target.description
poverty Under the official poverty line in 1989
poverty.hs Under the official poverty line in 1989
dropout Permanently dropped out of high school
dropout_interaction Permanently dropped out of high school
get_ged Received a GED instead of a high school diploma
get_bachelors Received a bachelor’s degree
oolf4wks Out of the labor force for four weeks or more in 1989
oolf4wks.hs Out of the labor force for four weeks or more in 1989
oolf4wks.col Out of the labor force for four weeks or more in 1989
unemployed4wks Unemployed for four weeks or more in 1989
unemployed4wks.hs Unemployed for four weeks or more in 1989
unemployed4wks.col Unemployed for four weeks or more in 1989
ever_married30 Ever married before the age of 30
ever_married30.hs Ever married before the age of 30
ever_married30.col Ever married before the age of 30
divorced_in5yrs Divorced within the first 5 years of marriage
divorced_in5yrs.hs Divorced within the first 5 years of marriage
divorced_in5yrs.col Divorced within the first 5 years of marriage
divorced_in5yrs_parents Divorced within the first 5 years of marriage
surveyed_in_jail The subject was interviewed in jail at least once from 1979 to 1990
surveyed_in_jail.hs The subject was interviewed in jail at least once from 1979 to 1990
middle_class_values Did the subject score ‘yes’ on the Middle Class Values Index
middle_class_values.hs Did the subject score ‘yes’ on the Middle Class Values Index
middle_class_values.col Did the subject score ‘yes’ on the Middle Class Values Index

Next I list the training data specifications used for this project (where the holdout data has the same definition but with Sample == ‘Supplement’):

## Note: The following variables are not present in Nation.txt. I created these to facilitate reproducing TBC's models. See TBC_Bootstrap for details.
# NOTRUN
#   GEDvHSGr_Ind = ifelse(GEDvHSGr == 'GED', 1, 0)
#   LTHSvHS_Ind = ifelse(LTHSvHS == 'LTHS', 1, 0)
#   WedBy30_Ind = ifelse(WedBy30 == '1 Yes', 1, 0)
# NOTRUN

kable(tbc.models[, .(label, filter.training)])
label filter.training
poverty Race4 == ‘White’ & Sample == ‘XSection’ & EmpSchl == 1
poverty.hs Race4 == ‘White’ & Sample == ‘XSection’ & EmpSchl == 1 & EdSample == ‘HS’
dropout Race4 == ‘White’ & Sample == ‘XSection’ & LTHSvGED != ‘GED’ & !is.na(AllvBA)
dropout_interaction Race4 == ‘White’ & Sample == ‘XSection’ & LTHSvGED != ‘GED’ & !is.na(AllvBA)
get_ged Race4 == ‘White’ & Sample == ‘XSection’ & LTHSvGED != ‘LTHS’ & !is.na(AllvBA)
get_bachelors Race4 == ‘White’ & Sample == ‘XSection’ & !(BA_Atta == ‘InSchl/NoD’ & AllvBA == 0 & is.na(AllvBA))
oolf4wks Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Emp568) & Sex == ‘Male’
oolf4wks.hs Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Emp568) & Sex == ‘Male’ & EdSample == ‘HS’
oolf4wks.col Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Emp568) & Sex == ‘Male’ & EdSample == ‘College’
unemployed4wks Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Emp568) & Sex == ‘Male’ & !is.na(UnempSam)
unemployed4wks.hs Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Emp568) & Sex == ‘Male’ & !is.na(UnempSam) & EdSample == ‘HS’
unemployed4wks.col Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Emp568) & Sex == ‘Male’ & !is.na(UnempSam) & EdSample == ‘College’
ever_married30 Race4 == ‘White’ & Sample == ‘XSection’ & IntAge90 >= 30 & !is.na(EverWed)
ever_married30.hs Race4 == ‘White’ & Sample == ‘XSection’ & IntAge90 >= 30 & !is.na(EverWed) & EdSample == ‘HS’
ever_married30.col Race4 == ‘White’ & Sample == ‘XSection’ & IntAge90 >= 30 & !is.na(EverWed) & EdSample == ‘College’
divorced_in5yrs Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Div5Yrs)
divorced_in5yrs.hs Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Div5Yrs) & EdSample == ‘HS’
divorced_in5yrs.col Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Div5Yrs) & EdSample == ‘College’
divorced_in5yrs_parents Race4 == ‘White’ & Sample == ‘XSection’ & !is.na(Div5Yrs) & Adult14S != ’’
surveyed_in_jail Race4 == ‘White’ & Sample == ‘XSection’ & Sex == ‘Male’
surveyed_in_jail.hs Race4 == ‘White’ & Sample == ‘XSection’ & Sex == ‘Male’ & EdSample == ‘HS’
middle_class_values Race4 == ‘White’ & Sample == ‘XSection’
middle_class_values.hs Race4 == ‘White’ & Sample == ‘XSection’ & EdSample == ‘HS’
middle_class_values.col Race4 == ‘White’ & Sample == ‘XSection’ & EdSample == ‘College’

Lastly, I list the model specifications used for this project:

kable(tbc.models[, .(label, formula)])
label formula
poverty Pov89 ~ zAFQT89 + zSES + zAge
poverty.hs Pov89 ~ zAFQT89 + zSES + zAge
dropout LTHSvHS_Ind ~ zAFQT89 + zSES + zAge
dropout_interaction LTHSvHS_Ind ~ zAFQT89 + zSES + zAge + zAFQT89:zSES
get_ged GEDvHSGr_Ind ~ zAFQT89 + zSES + zAge
get_bachelors AllvBA ~ zAFQT89 + zSES + zAge
oolf4wks MoOLF89 ~ zAFQT89 + zSES + zAge
oolf4wks.hs MoOLF89 ~ zAFQT89 + zSES + zAge
oolf4wks.col MoOLF89 ~ zAFQT89 + zSES + zAge
unemployed4wks MoUnemp8 ~ zAFQT89 + zSES + zAge
unemployed4wks.hs MoUnemp8 ~ zAFQT89 + zSES + zAge
unemployed4wks.col MoUnemp8 ~ zAFQT89 + zSES + zAge
ever_married30 WedBy30_Ind ~ zAFQT89 + zSES + zAge
ever_married30.hs WedBy30_Ind ~ zAFQT89 + zSES + zAge
ever_married30.col WedBy30_Ind ~ zAFQT89 + zSES + zAge
divorced_in5yrs Div5Yrs ~ zAFQT89 + zSES + zAge + MarDate
divorced_in5yrs.hs Div5Yrs ~ zAFQT89 + zSES + zAge + MarDate
divorced_in5yrs.col Div5Yrs ~ zAFQT89 + zSES + zAge + MarDate
divorced_in5yrs_parents Div5Yrs ~ zAFQT89 + zSES + zAge + Adult14S
surveyed_in_jail Jail ~ zAFQT89 + zSES + zAge
surveyed_in_jail.hs Jail ~ zAFQT89 + zSES + zAge
middle_class_values MCV_Inde ~ zAFQT89 + zSES + zAge
middle_class_values.hs MCV_Inde ~ zAFQT89 + zSES + zAge
middle_class_values.col MCV_Inde ~ zAFQT89 + zSES + zAge

Methods

Below I give more detail on how the MCC predictive evaluation is performed:

  1. Specify a training dataset and formula consistent with a TBC model.

  2. Fit 10,000 logistic regression models via bootstrapping (i.e., resampling the observed joint distribution with equal sizes)

  3. Apply each bootstrapped model to the training data.

  4. Map the modeled probabilities to a classification by selecting a cutoff value

  5. Build a confusion matrix with the modeled classifications and actual target data.

  6. Calculate various performance metrics for a binary target, in particular, MCC.

  7. Repeat steps 4 - 6 for a large range of possible cutoff values (i.e., a linear grid of length 500 between 0 and 1).

  8. Inspect the distribution of MCC across all cutoff values to find the maximum MCC value.

  9. With the IQ factor on the x-axis, plot the average observed target and bootstrapped classifications for a set of cutoff values (including that which maximizes MCC). Use the plot to build intuition on the maximum MCC value.

  10. Repeat all prior steps for each of the 24 training datasets and formulae.

  11. Summarize the maximum MCC values across all reproduced TBC models and by training and holdout data.

These steps are implemented in R using the following scripts within this repository:

  • 000_Misc.R: Attaches needed packages and defines a convenience function to report timing

  • 100_Resampling.R: Custom functions to perform resampling techniques via bootstrapping and the jackknife method

  • 200_Classification.R: Custom functions to create classifications from modeled probabilities and then quantify their performance

  • 300_Visualizations.R: Custom functions to create various visualizations to understand the predictive performance of binary classifications

  • 400_Model_Definitions.R: Parameters used to define and reproduce logistic regression models from The Bell Curve

  • TBC_Bootstrap.R: Recursive code that executes a batch run for each TBC probability model. This script is the workhorse of the entire project. See its code for details.

Note: When 10,000 bootstrap iterations are used on all 24 models in the prior script, the entire recursive batch process:

  • Produces 941 files in the “100 Data” folder, 6.76 GBs in total
  • Takes 4 - 5 hours

Results

In this section, I present two types of results. First, I walkthrough output from the poverty model. Output from the remaining models is displayed in the appendix. Second, I summarize the optimal MCC values as evaluated on the training and holdout datasets for each model.

Walkthrough of Poverty Model

Before describing output from the poverty model, I import all the saved plot images and other objects from the batch run (executed by “TBC_Bootstrap.R”).

library(cowplot)
# Note: To import svg images, cowplot requires the magick package but only states so when trying to import them

# Load data
save.folder <- '100 Data'

# Create more simpler version of model labels
model.labels <- tbc.models$label

# Get filenames of all data saved from the bootstrap analysis
filenames <- list.files(save.folder)

# For bootstrap plots and output files, create named lists via model labels that stores the relevant filenames
#   E.g., 
#   names(filenames.plots)[1] = 'poverty'
#   names(filenames.plots)[24] = 'middle_class_values.col'
filenames.plots <- vector(mode="list", length = length(model.labels))
names(filenames.plots) <- model.labels

# For each model label, store relevant filenames for either the bootstrap plots or output files
for(label in model.labels) {
  filenames.plots[[label]] <- grep(pattern = paste0('^',label,'_plot'), x = filenames, perl = TRUE, value = TRUE)
}

# For bootstrap plots and output files, create named lists via model labels that stores the relevant R objects
# Within each list index of plots, create named list to denote each type of R object for easier access in the report
#   E.g.,
#   names(plots)[1] = 'poverty'
#   names(plots[[1]])[1] <- plots.a2e
plot.type <- c(
  'plots.a2e_zAFQT89'
  ,'plots.a2e_zAge'
  ,'plots.a2e_zSES'
  ,'plots.hist.cutoffs.training_cutoff.2stdev'
  ,'plots.hist.cutoffs.training_cutoff.roc'
  ,'plots.hist.cutoffs.training_cutoff.mcc'
  ,'plots.hist.cutoffs.holdout_cutoff.2stdev'
  ,'plots.hist.cutoffs.holdout_cutoff.roc'
  ,'plots.hist.cutoffs.holdout_cutoff.mcc'
  ,'plots.class.cutoffs.training_cutoff.2stdev'
  ,'plots.class.cutoffs.training_cutoff.roc'
  ,'plots.class.cutoffs.training_cutoff.mcc'
  ,'plots.class.cutoffs.holdout_cutoff.2stdev'
  ,'plots.class.cutoffs.holdout_cutoff.roc'
  ,'plots.class.cutoffs.holdout_cutoff.mcc'
  ,'plots.classFormat.mcc.training_plot.a2e'
  ,'plots.classFormat.mcc.training_plot.volume'
  ,'plots.classFormat.training_plot.a2e'
  ,'plots.classFormat.training_plot.volume'
  ,'plots.classFormat.mcc.holdout_plot.a2e'
  ,'plots.classFormat.mcc.holdout_plot.volume'
  ,'plots.classFormat.holdout_plot.a2e'
  ,'plots.classFormat.holdout_plot.volume'
  ,'plot.dist.mcc.training'
  ,'plot.dist.mcc.holdout'
  ,'plot.roc.training'
  ,'plot.roc.holdout'
  ,'plot.dist.mcc.training.bootCI'
  ,'plot.dist.mcc.holdout.bootCI'
)
plots <- vector(mode="list", length = length(model.labels))
names(plots) <- model.labels

# Import each saved SVG file and store it as a named element within the plots list
for(label in model.labels) {
  plots[[label]] <- vector(mode="list", length = length(plot.type))
  names(plots[[label]]) <- plot.type
  for(type in plot.type) {
    filename.plot.type <- paste0(save.folder,'/',label,'_',type,'.svg')
    plots[[label]][[type]] <- cowplot::ggdraw() + cowplot::draw_image(magick::image_read_svg(filename.plot.type, width = 3000))
  }
}

# Import the outputs list for the poverty model
output.poverty <- data.table(t(readRDS('100 Data/poverty_outputs.rds')))

Since reproducibility is critical for this project, I display the model parameters associated with the original, non-bootstrapped poverty model. These values should correspond to the ones listed on P. 620 of TBC. (The reference TBC page for each model is in the tbc.models$appendix.page field as described in the “400_Model_Definitions.R” script)

summary(output.poverty[1L, glm.model][[1]])
## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2724  -0.4127  -0.3041  -0.2153   3.2655  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.64877    0.07688 -34.452  < 2e-16 ***
## zAFQT89     -0.83767    0.09351  -8.958  < 2e-16 ***
## zSES        -0.33008    0.09010  -3.663 0.000249 ***
## zAge        -0.02384    0.07237  -0.329 0.741863    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1750.7  on 3366  degrees of freedom
## Residual deviance: 1568.8  on 3363  degrees of freedom
## AIC: 1576.8
## 
## Number of Fisher Scoring iterations: 6

The 3 plots below show the average of the poverty target and the modeled probabilities without bootstrapping across a banded version of the 3 main variables: zAFQT89, zSES, and zAge. Both zAFQT89 and zSES tend to have a monotonically decreasing relationship with the poverty target. The plot with zAFQT89 is the least noisy of the 3 and that of zAge is the most noisy. By inspection, these plots show that zAFQT89 and zSES influence the target, but do not capture a significant amount of variance in the data.

Note: I am having plotting issues in this version of the report, as shown by the excessive whitespace around plots from imported SVG files.

plots[['poverty']][['plots.a2e_zAFQT89']]
plots[['poverty']][['plots.a2e_zSES']]
plots[['poverty']][['plots.a2e_zAge']]

The 2 plots below show the bootstrapped distributions of MCC values across a dense grid of cutoff values. The first plot is the MCC distribution when evaluated on the training data, the second is that using the holdout data. The optimal MCC value is displayed above each plot.

The solid orange line represents the original sample MCC values for each cutoff. The shaded region around this line is the accelerated, bias-corrected 95% percentile intervals for the sample MCC values. The dotted orange line represents the average bootstrapped MCC value by cutoff. The blue hexagonal tiles represent a density plot of bootstrapped MCC values for that local region. Higher density tiles have a lighter color of blue.

cat(paste0('The optimal MCC value using training data = ',output.poverty[1L, mcc.optimal.training][[1]]))
## The optimal MCC value using training data = 0.225068316915218
cat(paste0('The cutoff value associated with the optimal MCC value from training data = ',output.poverty[1L, cutoffs][[1]]['cutoff.mcc']))
## The cutoff value associated with the optimal MCC value from training data = 0.15
plots[['poverty']][['plot.dist.mcc.training.bootCI']]

cat(paste0('The optimal MCC value using holdout data = ',output.poverty[1L, mcc.optimal.holdout][[1]]))
## The optimal MCC value using holdout data = 0.337074404515184
plots[['poverty']][['plot.dist.mcc.holdout.bootCI']]

The 2 plots below show the averages of the poverty target, modeled probabilities, and the classifications using the cutoff value associated with the optimal MCC value from the training data vs a banded version of zAFQT89. The first plot shows results using the training data, the second is that using the holdout data. Overall, the average classifications poorly reflect the observed values for both datasets, which is consistent with optimal MCC values below 35%.

plots[['poverty']][['plots.classFormat.mcc.training_plot.a2e']]
plots[['poverty']][['plots.classFormat.mcc.holdout_plot.a2e']]

Comparison of Optimal MCC Values

The main output of this project is the following plot, which compares the optimal MCC values between the training and holdout datasets for each model. The x-axis represents the optimal MCC values as evaluated on the training data, the y-axis is that evaluated on the holdout data. To simplify the plot, the models are grouped together by category. With the exception of a few education-related TBC models, all other TBC models have optimal MCC values below 50% for either training or holdout data.

# Import all outputs
library(RColorBrewer)

save.folder <- '100 Data'

model.labels <- tbc.models$label
filenames.output <- paste0(save.folder, '/', model.labels, '_outputs.rds')
outputs <- rbindlist(lapply(seq_along(filenames.output), function(x) data.table(t(readRDS(filenames.output[x])))))
outputs[, label := model.labels]

categories <- c(
  'Poverty'
  ,'Poverty'
  ,'Education'
  ,'Education'
  ,'Education'
  ,'Education'
  ,'Employment'
  ,'Employment'
  ,'Employment'
  ,'Employment'
  ,'Employment'
  ,'Employment'
  ,'Marriage'
  ,'Marriage'
  ,'Marriage'
  ,'Marriage'
  ,'Marriage'
  ,'Marriage'
  ,'Marriage'
  ,'Crime'
  ,'Crime'
  ,shQuote('Middle Class Values')
  ,shQuote('Middle Class Values')
  ,shQuote('Middle Class Values')
)

# Create category field
outputs[, Category := factor(categories, levels = unique(categories))]

# Ensure MCC fields are numeric
outputs[, mcc.optimal.training := as.numeric(mcc.optimal.training)]
outputs[, mcc.optimal.holdout := as.numeric(mcc.optimal.holdout)]

# Plot
ggplot(
  outputs[, .(mcc.optimal.training, mcc.optimal.holdout, Category)]
  , aes(x = mcc.optimal.training, y = mcc.optimal.holdout, group = Category, color = Category, shape = Category)
) + 
  geom_point(size = 4) + 
  scale_color_brewer(palette = "Dark2") +
  xlab('Optimal MCC Value (Training Data)') + 
  ylab('Optimal MCC Value (Holdout Data)') + 
  lims(x = c(0,1), y = c(0,1)) + 
  theme_bw() + 
  theme(
    axis.text=element_text(size=16)
    , axis.title=element_text(size=16)
    , legend.text=element_text(size=16)
    , legend.title=element_text(size=16)
  )

Discussion

What does the plot comparing optimal MCC values mean? Using the guidelines for interpreting MCC values listed above, nearly all TBC models have a low correlation (i.e., below 50%) between their observed binary target and the bootstrapped classification when evaluated on either training or holdout data. This suggests TBC’s probability models are not materially predictive and so they inadequately support the authors’ public policy recommendations.

The only TBC models with optimal MCC values above 50% are those with binary targets related to education: receiving a GED, receiving a Bachelor’s, or dropping out of school. Again using the above guidelines for interpreting MCC values, the optimal MCC values for these models are “moderate.” While these models are relatively more predictive than the other TBC models, they alone do not warrant the authors’ recommendations.

Appendix: Model Output

Note: These really should be generated by a simple function, but I’ve run out of time…

Poverty - Highschool

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1980  -0.4354  -0.3543  -0.2759   2.9017  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.7238     0.1290 -21.111  < 2e-16 ***
## zAFQT89      -0.8267     0.1627  -5.080 3.77e-07 ***
## zSES         -0.3620     0.1500  -2.413   0.0158 *  
## zAge          0.1049     0.1095   0.959   0.3378    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 694.58  on 1235  degrees of freedom
## Residual deviance: 650.54  on 1232  degrees of freedom
## AIC: 658.54
## 
## Number of Fisher Scoring iterations: 5
## The optimal MCC value using training data = 0.17343841842523
## The cutoff value associated with the optimal MCC value from training data = 0.08
## The optimal MCC value using holdout data = 0.312399884345027

Dropout

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.60536  -0.38341  -0.20105  -0.09667   3.06586  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.85323    0.09395 -30.368  < 2e-16 ***
## zAFQT89     -1.72296    0.10281 -16.759  < 2e-16 ***
## zSES        -0.64776    0.08966  -7.225 5.03e-13 ***
## zAge         0.05696    0.06883   0.828    0.408    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2347.8  on 3571  degrees of freedom
## Residual deviance: 1560.0  on 3568  degrees of freedom
## AIC: 1568
## 
## Number of Fisher Scoring iterations: 6
## The optimal MCC value using training data = 0.494010257784765
## The cutoff value associated with the optimal MCC value from training data = 0.2
## The optimal MCC value using holdout data = 0.601408676867907

Dropout - Interaction

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4996  -0.3901  -0.1779  -0.0617   3.2502  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.91432    0.10293 -28.313  < 2e-16 ***
## zAFQT89      -1.89377    0.11884 -15.936  < 2e-16 ***
## zSES         -0.94024    0.12505  -7.519 5.53e-14 ***
## zAge          0.05227    0.06827   0.766 0.443944    
## zAFQT89:zSES -0.41332    0.11878  -3.480 0.000502 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2347.8  on 3571  degrees of freedom
## Residual deviance: 1547.8  on 3567  degrees of freedom
## AIC: 1557.8
## 
## Number of Fisher Scoring iterations: 7
## The optimal MCC value using training data = 0.492245432165761
## The cutoff value associated with the optimal MCC value from training data = 0.21
## The optimal MCC value using holdout data = 0.594994937920605

Receive GED

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2512  -0.4473  -0.3507  -0.2584   2.7730  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.35485    0.06539 -36.014  < 2e-16 ***
## zAFQT89     -0.43253    0.08512  -5.081 3.75e-07 ***
## zSES        -0.60822    0.08375  -7.262 3.81e-13 ***
## zAge        -0.04164    0.06624  -0.629     0.53    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1974.7  on 3493  degrees of freedom
## Residual deviance: 1830.6  on 3490  degrees of freedom
## AIC: 1838.6
## 
## Number of Fisher Scoring iterations: 6
## The optimal MCC value using training data = 0.174449610068079
## The cutoff value associated with the optimal MCC value from training data = 0.06
## The optimal MCC value using holdout data = 0.285684510684511

Receive Bachelor’s

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6986  -0.5662  -0.2698   0.1912   3.6879  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.41463    0.07783 -31.026  < 2e-16 ***
## zAFQT89      1.77473    0.07825  22.680  < 2e-16 ***
## zSES         1.01514    0.06784  14.963  < 2e-16 ***
## zAge        -0.30316    0.05113  -5.929 3.05e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4365.6  on 3856  degrees of freedom
## Residual deviance: 2778.9  on 3853  degrees of freedom
## AIC: 2786.9
## 
## Number of Fisher Scoring iterations: 6
## The optimal MCC value using training data = 0.55399479297447
## The cutoff value associated with the optimal MCC value from training data = 0.26
## The optimal MCC value using holdout data = 0.646041223438848

Out of Labor Force

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7781  -0.5032  -0.4407  -0.3839   2.5344  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.20264    0.08680 -25.376  < 2e-16 ***
## zAFQT89     -0.36247    0.09928  -3.651 0.000261 ***
## zSES         0.21788    0.10757   2.025 0.042820 *  
## zAge        -0.12815    0.08640  -1.483 0.138013    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1115.4  on 1685  degrees of freedom
## Residual deviance: 1096.5  on 1682  degrees of freedom
## AIC: 1104.5
## 
## Number of Fisher Scoring iterations: 5
## The optimal MCC value using training data = 0.0924794744495883
## The cutoff value associated with the optimal MCC value from training data = 0.1
## The optimal MCC value using holdout data = 0.144380923743628

Out of Labor Force - Highschool

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7061  -0.4238  -0.3623  -0.3052   2.6075  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.6978     0.1767 -15.264   <2e-16 ***
## zAFQT89      -0.4215     0.2264  -1.862   0.0627 .  
## zSES          0.5649     0.2300   2.456   0.0141 *  
## zAge         -0.1456     0.1673  -0.870   0.3841    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 322.88  on 620  degrees of freedom
## Residual deviance: 313.96  on 617  degrees of freedom
## AIC: 321.96
## 
## Number of Fisher Scoring iterations: 5
## The optimal MCC value using training data = 0.0905788678471054
## The cutoff value associated with the optimal MCC value from training data = 0.09
## The optimal MCC value using holdout data = 0.0854278704792859

Out of Labor Force - College

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0872  -0.3847  -0.2834  -0.1921   2.9217  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.1296     0.6082  -5.146 2.66e-07 ***
## zAFQT89      -0.8432     0.4527  -1.863   0.0625 .  
## zSES          0.9451     0.3875   2.439   0.0147 *  
## zAge         -0.4606     0.2990  -1.540   0.1235    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 126.66  on 267  degrees of freedom
## Residual deviance: 113.07  on 264  degrees of freedom
## AIC: 121.07
## 
## Number of Fisher Scoring iterations: 6
## The optimal MCC value using training data = 0.251892900905496
## The cutoff value associated with the optimal MCC value from training data = 0.18
## The optimal MCC value using holdout data = 0.0125472709709175

Unemployed For At Least 4 Weeks

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7332  -0.4182  -0.3531  -0.2979   2.6654  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.53577    0.10760 -23.566  < 2e-16 ***
## zAFQT89     -0.49487    0.12989  -3.810 0.000139 ***
## zSES        -0.02535    0.13838  -0.183 0.854661    
## zAge        -0.02181    0.11084  -0.197 0.843970    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 720.05  on 1396  degrees of freedom
## Residual deviance: 697.43  on 1393  degrees of freedom
## AIC: 705.43
## 
## Number of Fisher Scoring iterations: 5
## The optimal MCC value using training data = 0.13636341913744
## The cutoff value associated with the optimal MCC value from training data = 0.54
## The optimal MCC value using holdout data = 0.167676842722803

Unemployed For At Least 4 Weeks - Highschool

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5698  -0.4232  -0.3734  -0.3369   2.4738  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.5988     0.1766 -14.715   <2e-16 ***
## zAFQT89      -0.3935     0.2369  -1.661   0.0966 .  
## zSES          0.1395     0.2353   0.593   0.5532    
## zAge         -0.1051     0.1762  -0.596   0.5509    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 284.71  on 536  degrees of freedom
## Residual deviance: 280.98  on 533  degrees of freedom
## AIC: 288.98
## 
## Number of Fisher Scoring iterations: 5
## The optimal MCC value using training data = 0.152252980756548
## The cutoff value associated with the optimal MCC value from training data = 0.55
## The optimal MCC value using holdout data = 0.0934865768327446

Unemployed For At Least 4 Weeks - College

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7832  -0.3498  -0.2655  -0.1831   2.9024  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.1687     0.7277  -4.355 1.33e-05 ***
## zAFQT89      -0.9197     0.5641  -1.630   0.1031    
## zSES          1.0039     0.5016   2.002   0.0453 *  
## zAge          0.2942     0.3311   0.889   0.3743    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 88.152  on 227  degrees of freedom
## Residual deviance: 81.012  on 224  degrees of freedom
## AIC: 89.012
## 
## Number of Fisher Scoring iterations: 6
## The optimal MCC value using training data = 0.294795327544672
## The cutoff value associated with the optimal MCC value from training data = 0.98
## The optimal MCC value using holdout data = 0.216322010592139

Ever Get Married

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0635   0.5747   0.6645   0.7173   0.9074  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.19841    0.12890   9.297   <2e-16 ***
## zAFQT89     -0.04736    0.07579  -0.625   0.5320    
## zSES        -0.19055    0.07863  -2.423   0.0154 *  
## zAge         0.20403    0.12905   1.581   0.1139    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1692.4  on 1633  degrees of freedom
## Residual deviance: 1679.5  on 1630  degrees of freedom
## AIC: 1687.5
## 
## Number of Fisher Scoring iterations: 4
## The optimal MCC value using training data = 0.0615536629052357
## The cutoff value associated with the optimal MCC value from training data = 0.81
## The optimal MCC value using holdout data = 0.0448060684227072

Ever Get Married - Highschool

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2489   0.4522   0.5444   0.6280   0.9513  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.4149     0.2343   6.040 1.54e-09 ***
## zAFQT89       0.5142     0.1598   3.217  0.00129 ** 
## zSES         -0.1129     0.1583  -0.713  0.47572    
## zAge          0.3683     0.2423   1.520  0.12846    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 532.66  on 604  degrees of freedom
## Residual deviance: 518.81  on 601  degrees of freedom
## AIC: 526.81
## 
## Number of Fisher Scoring iterations: 4
## The optimal MCC value using training data = 0.169164163883082
## The cutoff value associated with the optimal MCC value from training data = 0.71
## The optimal MCC value using holdout data = 0.0359356811522763

Ever Get Married - College

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6383  -1.5055   0.8298   0.8615   0.9549  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.71372    0.39462   1.809   0.0705 .
## zAFQT89      0.05014    0.22375   0.224   0.8227  
## zSES         0.09683    0.18337   0.528   0.5975  
## zAge        -0.01778    0.29509  -0.060   0.9520  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 291.06  on 236  degrees of freedom
## Residual deviance: 290.71  on 233  degrees of freedom
## AIC: 298.71
## 
## Number of Fisher Scoring iterations: 4
## The optimal MCC value using training data = 0.0429999820833445
## The cutoff value associated with the optimal MCC value from training data = 0.96
## The optimal MCC value using holdout data = 0.0659401887074127

Divorced in 1st 5 Years of Marriage

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1265  -0.7064  -0.6137  -0.4877   2.2772  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  5.70861    1.98581   2.875 0.004044 ** 
## zAFQT89     -0.35734    0.07813  -4.574 4.79e-06 ***
## zSES         0.22195    0.07876   2.818 0.004831 ** 
## zAge        -0.17767    0.07415  -2.396 0.016568 *  
## MarDate     -0.08677    0.02431  -3.569 0.000358 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2026.5  on 2030  degrees of freedom
## Residual deviance: 1982.7  on 2026  degrees of freedom
## AIC: 1992.7
## 
## Number of Fisher Scoring iterations: 4
## The optimal MCC value using training data = 0.116036786658606
## The cutoff value associated with the optimal MCC value from training data = 0.19
## The optimal MCC value using holdout data = 0.13753196318916

Divorced in 1st 5 Years of Marriage - Highschool

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8893  -0.6880  -0.6325  -0.5499   2.1149  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  5.44514    3.12869   1.740   0.0818 .
## zAFQT89     -0.03792    0.13481  -0.281   0.7785  
## zSES         0.22069    0.12882   1.713   0.0867 .
## zAge        -0.10781    0.11468  -0.940   0.3472  
## MarDate     -0.08400    0.03832  -2.192   0.0284 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 864.50  on 874  degrees of freedom
## Residual deviance: 857.41  on 870  degrees of freedom
## AIC: 867.41
## 
## Number of Fisher Scoring iterations: 4
## The optimal MCC value using training data = 0.0539839906990826
## The cutoff value associated with the optimal MCC value from training data = 0.19
## The optimal MCC value using holdout data = 0.0704713471092265

Divorced in 1st 5 Years of Marriage - College

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0638  -0.4025  -0.3032  -0.2108   2.8276  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) 32.39423   13.50911   2.398   0.0165 *
## zAFQT89     -0.75623    0.45023  -1.680   0.0930 .
## zSES        -0.07353    0.35889  -0.205   0.8377  
## zAge        -0.55879    0.40470  -1.381   0.1674  
## MarDate     -0.41115    0.16298  -2.523   0.0116 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 107.925  on 208  degrees of freedom
## Residual deviance:  96.829  on 204  degrees of freedom
## AIC: 106.83
## 
## Number of Fisher Scoring iterations: 6
## The optimal MCC value using training data = 0.203076829254132
## The cutoff value associated with the optimal MCC value from training data = 0.07
## The optimal MCC value using holdout data = 0.339836917410076

Divorced in 1st 5 Years of Marriage - Parents Status

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0071  -0.6985  -0.6214  -0.5131   2.2298  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.12299    0.19200  -5.849 4.95e-09 ***
## zAFQT89          -0.39256    0.07742  -5.070 3.97e-07 ***
## zSES              0.19104    0.07833   2.439   0.0147 *  
## zAge             -0.02781    0.06177  -0.450   0.6526    
## Adult14S2 Bio    -0.31954    0.20437  -1.564   0.1179    
## Adult14SBio/Step -0.11008    0.25682  -0.429   0.6682    
## Adult14SOther    -0.25948    0.37893  -0.685   0.4935    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2022.8  on 2028  degrees of freedom
## Residual deviance: 1989.4  on 2022  degrees of freedom
## AIC: 2003.4
## 
## Number of Fisher Scoring iterations: 4
## The optimal MCC value using training data = 0.0940162389447624
## The cutoff value associated with the optimal MCC value from training data = 0.19
## The optimal MCC value using holdout data = 0.111113301248484

Ever Surveyed in Jail

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8179  -0.2506  -0.1819  -0.1337   3.4623  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.7772     0.1718 -21.986  < 2e-16 ***
## zAFQT89      -0.8967     0.1754  -5.113 3.17e-07 ***
## zSES         -0.1555     0.1806  -0.861    0.389    
## zAge          0.0783     0.1469   0.533    0.594    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 486.43  on 1944  degrees of freedom
## Residual deviance: 439.80  on 1941  degrees of freedom
## AIC: 447.8
## 
## Number of Fisher Scoring iterations: 7
## The optimal MCC value using training data = 0.150823426510628
## The cutoff value associated with the optimal MCC value from training data = 0.02
## The optimal MCC value using holdout data = 0.213056160247416

Ever Surveyed in Jail - Highschool

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6631  -0.1559  -0.1172  -0.0893   3.2606  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.9658     0.4806 -10.332   <2e-16 ***
## zAFQT89      -1.0701     0.4431  -2.415   0.0157 *  
## zSES         -0.1621     0.4643  -0.349   0.7270    
## zAge          0.4673     0.3678   1.271   0.2039    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 87.818  on 715  degrees of freedom
## Residual deviance: 79.701  on 712  degrees of freedom
## AIC: 87.701
## 
## Number of Fisher Scoring iterations: 8
## The optimal MCC value using training data = 0.255071154213153
## The cutoff value associated with the optimal MCC value from training data = 0.99
## The optimal MCC value using holdout data = 0.0374968148960129

Yes to “Middle Class Values”

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0494  -1.0869   0.5879   1.0762   2.1285  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.063853   0.038934  -1.640    0.101    
## zAFQT89      0.632507   0.052818  11.975  < 2e-16 ***
## zSES         0.244956   0.052062   4.705 2.54e-06 ***
## zAge         0.006637   0.040193   0.165    0.869    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4198.3  on 3028  degrees of freedom
## Residual deviance: 3874.9  on 3025  degrees of freedom
## AIC: 3882.9
## 
## Number of Fisher Scoring iterations: 4
## The optimal MCC value using training data = 0.27127871233613
## The cutoff value associated with the optimal MCC value from training data = 0.43
## The optimal MCC value using holdout data = 0.450510368706667

Yes to “Middle Class Values” - Highschool

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6090  -1.3136   0.9618   1.0268   1.2166  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.39448    0.06118   6.448 1.14e-10 ***
## zAFQT89      0.16815    0.09312   1.806   0.0710 .  
## zSES        -0.17993    0.09034  -1.992   0.0464 *  
## zAge         0.01888    0.06218   0.304   0.7614    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1569.7  on 1161  degrees of freedom
## Residual deviance: 1563.7  on 1158  degrees of freedom
## AIC: 1571.7
## 
## Number of Fisher Scoring iterations: 4
## The optimal MCC value using training data = 0.0588466467419308
## The cutoff value associated with the optimal MCC value from training data = 0.57
## The optimal MCC value using holdout data = 0.0577337608415983

Yes to “Middle Class Values” - College

## 
## Call:
## glm(formula = formula.model, family = binomial(link = "logit"), 
##     data = DT.model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0979   0.5413   0.6295   0.7029   1.0368  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.99516    0.23868   4.169 3.05e-05 ***
## zAFQT89      0.39251    0.19881   1.974   0.0483 *  
## zSES         0.03692    0.16858   0.219   0.8266    
## zAge         0.13876    0.13364   1.038   0.2991    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 406.72  on 401  degrees of freedom
## Residual deviance: 400.18  on 398  degrees of freedom
## AIC: 408.18
## 
## Number of Fisher Scoring iterations: 4
## The optimal MCC value using training data = 0.130393984723014
## The cutoff value associated with the optimal MCC value from training data = 0.73
## The optimal MCC value using holdout data = 0.0679175604648386

References


  1. Herrnstein and Murray. (1994). The Bell Curve. New York: The Free Press.

  2. Bureau of Labor Statistics, U.S. Department of Labor. National Longitudinal Survey of Youth 1979 cohort, 1979-2012 (rounds 1-25). Produced and distributed by the Center for Human Resource Research, The Ohio State University. Columbus, OH: 2014.

  3. prepared for the U.S. Department of Labor by Center for Human Resource Research, The Ohio State University. (2001). NLSY79 users’ guide : a guide to the 1979-2000 National Longitudinal Survey of Youth data. Columbus, Ohio :Center for Human Resource Research, Ohio State University.

  4. Heckman, J. (1995). Lessons from The Bell Curve. Journal of Political Economy, 103(5), 1091–1120.

  5. Krenz, C. (2007, August 8). Anatomy of an Analysis. Retrieved from http://www.claudiax.net/bell.html.

  6. Shmueli, G. (2010). To Explain or To Predict? Statistical Science, 25(3), 289–310.

  7. Miele, F. (1995). For Whom The Bell Curve Tolls. Skeptic, Volume 3, #3, 34 – 41.

  8. Boughorbel S, Jarray F, El-Anbari M (2017) Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE 12(6).

  9. Mukaka MM. Statistics corner: A guide to appropriate use of correlation coefficient in medical research. Malawi Med J. 2012;24(3):69-71.

  10. Herrnstein and Murray (1994). Nation.txt, Nation.hdr, and 2TBC_Documentation.ascii. Retrieved from http://www.rasmusen.org/xpacioli/bellcurve.